This dataset collects information from more than 100,000 medical appointments in Brazil and is focused on the question of whether or not patients show up for their appointment. This analysis explores factors influencing no-shows for medical appointments using the data. The goal of this analysis is to uncover patterns and insights that might explain why patients miss appointments and to identify areas for potential intervention.The dependent variable is no_show, indicating whether a patient missed their scheduled appointment (Yes = did not show up, No = showed up).
We examine several independent variables, including:
Data columns:
In this analysis, I will explore age, scholarship, and sms_received and this will be the main question that I will seek to answer. In addition, I will address two more questions that I think will add context and depth to the analysis:
Are there neighborhoods with low no-show rates that could serve as a model for underperforming neighborhoods, with respect to no-show rates?
This question builds on the first question, and serves as an alternate hypothesis for research. If none of the factors listed in question 1 are a factor, then do chronic conditions have an effect on show rates?
# import statements for all of the packages that will be used.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import matplotlib.cm as cm
from scipy.stats import chi2_contingency
In this section of the report, I will load in the data, take a preliminary look, check for cleanliness, and then trim and clean the dataset for analysis.
df = pd.read_csv("noshowappointments-kagglev2-may-2016.csv")
df.head(2)
| PatientId | AppointmentID | Gender | ScheduledDay | AppointmentDay | Age | Neighbourhood | Scholarship | Hipertension | Diabetes | Alcoholism | Handcap | SMS_received | No-show | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2.987250e+13 | 5642903 | F | 2016-04-29T18:38:08Z | 2016-04-29T00:00:00Z | 62 | JARDIM DA PENHA | 0 | 1 | 0 | 0 | 0 | 0 | No |
| 1 | 5.589978e+14 | 5642503 | M | 2016-04-29T16:08:27Z | 2016-04-29T00:00:00Z | 56 | JARDIM DA PENHA | 0 | 0 | 0 | 0 | 0 | 0 | No |
df['Handcap'].value_counts()
0 108286 1 2042 2 183 3 13 4 3 Name: Handcap, dtype: int64
df.columns
Index(['PatientId', 'AppointmentID', 'Gender', 'ScheduledDay',
'AppointmentDay', 'Age', 'Neighbourhood', 'Scholarship', 'Hipertension',
'Diabetes', 'Alcoholism', 'Handcap', 'SMS_received', 'No-show'],
dtype='object')
df.value_counts()
PatientId AppointmentID Gender ScheduledDay AppointmentDay Age Neighbourhood Scholarship Hipertension Diabetes Alcoholism Handcap SMS_received No-show
3.921784e+04 5751990 F 2016-05-31T10:56:41Z 2016-06-03T00:00:00Z 44 PRAIA DO SUÁ 0 0 0 0 0 0 No 1
7.346244e+13 5719885 F 2016-05-19T12:40:47Z 2016-05-30T00:00:00Z 5 CENTRO 0 0 0 0 0 1 No 1
7.345683e+13 5764698 F 2016-06-02T10:45:50Z 2016-06-02T00:00:00Z 36 BONFIM 0 0 0 0 0 0 No 1
5726179 F 2016-05-20T13:14:48Z 2016-05-20T00:00:00Z 36 BONFIM 0 0 0 0 0 0 No 1
5686158 F 2016-05-11T12:13:04Z 2016-05-11T00:00:00Z 36 BONFIM 0 0 0 0 0 0 No 1
..
6.974936e+12 5674530 F 2016-05-09T11:13:07Z 2016-05-09T00:00:00Z 59 INHANGUETÁ 0 1 0 0 0 0 No 1
6.974857e+12 5644516 F 2016-05-02T08:46:25Z 2016-05-02T00:00:00Z 7 TABUAZEIRO 1 0 0 0 0 0 No 1
5643598 F 2016-05-02T07:38:03Z 2016-05-17T00:00:00Z 7 TABUAZEIRO 1 0 0 0 0 0 Yes 1
6.974625e+12 5707587 M 2016-05-17T09:49:31Z 2016-05-19T00:00:00Z 3 SANTA LUÍZA 0 0 0 0 0 0 No 1
9.999816e+14 5660958 F 2016-05-05T07:04:46Z 2016-05-05T00:00:00Z 1 SANTO ANTÔNIO 0 0 0 0 0 0 No 1
Length: 110527, dtype: int64
df.info() # take a look at data types, if there are any null values, memory usage
<class 'pandas.core.frame.DataFrame'> RangeIndex: 110527 entries, 0 to 110526 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 PatientId 110527 non-null float64 1 AppointmentID 110527 non-null int64 2 Gender 110527 non-null object 3 ScheduledDay 110527 non-null object 4 AppointmentDay 110527 non-null object 5 Age 110527 non-null int64 6 Neighbourhood 110527 non-null object 7 Scholarship 110527 non-null int64 8 Hipertension 110527 non-null int64 9 Diabetes 110527 non-null int64 10 Alcoholism 110527 non-null int64 11 Handcap 110527 non-null int64 12 SMS_received 110527 non-null int64 13 No-show 110527 non-null object dtypes: float64(1), int64(8), object(5) memory usage: 11.8+ MB
# show summary for all columns. no need to worry about NaN values here, those show up for categorical variables, which I cover in the next cell.
round(df.describe(include = 'all', datetime_is_numeric=True),2)
| PatientId | AppointmentID | Gender | ScheduledDay | AppointmentDay | Age | Neighbourhood | Scholarship | Hipertension | Diabetes | Alcoholism | Handcap | SMS_received | No-show | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1.105270e+05 | 110527.00 | 110527 | 110527 | 110527 | 110527.00 | 110527 | 110527.0 | 110527.0 | 110527.00 | 110527.00 | 110527.00 | 110527.00 | 110527 |
| unique | NaN | NaN | 2 | 103549 | 27 | NaN | 81 | NaN | NaN | NaN | NaN | NaN | NaN | 2 |
| top | NaN | NaN | F | 2016-05-06T07:09:54Z | 2016-06-06T00:00:00Z | NaN | JARDIM CAMBURI | NaN | NaN | NaN | NaN | NaN | NaN | No |
| freq | NaN | NaN | 71840 | 24 | 4692 | NaN | 7717 | NaN | NaN | NaN | NaN | NaN | NaN | 88208 |
| mean | 1.474963e+14 | 5675305.12 | NaN | NaN | NaN | 37.09 | NaN | 0.1 | 0.2 | 0.07 | 0.03 | 0.02 | 0.32 | NaN |
| std | 2.560949e+14 | 71295.75 | NaN | NaN | NaN | 23.11 | NaN | 0.3 | 0.4 | 0.26 | 0.17 | 0.16 | 0.47 | NaN |
| min | 3.921784e+04 | 5030230.00 | NaN | NaN | NaN | -1.00 | NaN | 0.0 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | NaN |
| 25% | 4.172614e+12 | 5640285.50 | NaN | NaN | NaN | 18.00 | NaN | 0.0 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | NaN |
| 50% | 3.173184e+13 | 5680573.00 | NaN | NaN | NaN | 37.00 | NaN | 0.0 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | NaN |
| 75% | 9.439172e+13 | 5725523.50 | NaN | NaN | NaN | 55.00 | NaN | 0.0 | 0.0 | 0.00 | 0.00 | 0.00 | 1.00 | NaN |
| max | 9.999816e+14 | 5790484.00 | NaN | NaN | NaN | 115.00 | NaN | 1.0 | 1.0 | 1.00 | 1.00 | 4.00 | 1.00 | NaN |
df.describe(include='object') # Categorical columns and their summary statistics
| Gender | ScheduledDay | AppointmentDay | Neighbourhood | No-show | |
|---|---|---|---|---|---|
| count | 110527 | 110527 | 110527 | 110527 | 110527 |
| unique | 2 | 103549 | 27 | 81 | 2 |
| top | F | 2016-05-06T07:09:54Z | 2016-06-06T00:00:00Z | JARDIM CAMBURI | No |
| freq | 71840 | 24 | 4692 | 7717 | 88208 |
df.shape #110527 rows of data with 14 columns
(110527, 14)
From the preceding cells we get a sense of how big our dataset is (110,527 rows and 14 columns), what kinds of values are stored in each record and if they'll need to be converted, deleted or addressed in some way, and we can start to get a sense of averages, minimums and maximums for some of the data contained within.
- Check for and remove null values to ensure completeness of the dataset.
- Identify and eliminate duplicate records to avoid skewed analysis.
- Inspect for unexpected values, such as an age of -1, and handle them appropriately.
- Detect and address outliers to ensure they don’t disproportionately influence results.
- Standardize categorical responses, like "Yes" vs "yes," for consistency in analysis.
- standardize categorical responses (Yes vs yes)
- Convert date columns to DateTime format to enable easier manipulation and accurate time-based analysis.
- Optimize memory usage by converting object columns to categories where applicable.
- Trim leading/trailing spaces in string values and replace dashes with underscores for uniform column naming.
def clean_column_names(df):
"""
Cleans column names in a DataFrame:
Converts to lowercase, replaces spaces and dashes with underscores.
"""
df.columns = df.columns.str.lower().str.replace(' ', '_', regex=True).str.replace('-', '_', regex=True)
return df
# Apply function to clean column names
df = clean_column_names(df)
# Check for missing values
print(df.isnull().sum())
# Drop rows with missing values
df_cleaned = df.dropna()
# Fill missing values with appropriate statistics
df['age'].fillna(df['age'].median(), inplace=True)
patientid 0 appointmentid 0 gender 0 scheduledday 0 appointmentday 0 age 0 neighbourhood 0 scholarship 0 hipertension 0 diabetes 0 alcoholism 0 handcap 0 sms_received 0 no_show 0 dtype: int64
# permanently drop the AppointmentID column, since it adds nothing to the analysis
#df.drop(['AppointmentID'], axis=1, inplace=True)
#note because of 'inplace=True', this cell will show an error if run more than one time after importing the dataset.
# renaming some columns
df.rename(columns={'hipertension': 'hypertension', 'handcap': 'handicap'},inplace=True)
# confirm changes
df.head(1)
| patientid | appointmentid | gender | scheduledday | appointmentday | age | neighbourhood | scholarship | hypertension | diabetes | alcoholism | handicap | sms_received | no_show | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 29872499824296 | 5642903 | F | 2016-04-29 18:38:08+00:00 | 2016-04-29 00:00:00+00:00 | 62 | JARDIM DA PENHA | 0 | 1 | 0 | 0 | 0 | 0 | No |
# Check for duplicate rows
print(df.duplicated().sum())
# Drop duplicate rows
df = df.drop_duplicates()
0
# Find rows with invalid ages
invalid_ages = df[df['age'] < 0]
print(invalid_ages)
# Filter out rows with invalid ages
df = df[df['age'] >= 0]
Empty DataFrame Columns: [patientid, appointmentid, gender, scheduledday, appointmentday, age, neighbourhood, scholarship, hypertension, diabetes, alcoholism, handicap, sms_received, no_show] Index: []
# Remove the row where 'age' is -1
df = df[df['age'] != -1]
# Verify the change
print(df[df['age'] == -1]) # This confirms an empty DataFrame
Empty DataFrame Columns: [patientid, appointmentid, gender, scheduledday, appointmentday, age, neighbourhood, scholarship, hypertension, diabetes, alcoholism, handicap, sms_received, no_show] Index: []
I chose to remove this row of data, rather than impute an average, or guess at what someone may have meant when entering that row of data. Since that single value is just 0.000905% of the dataset, it's not going to affect any results by doing so.
df.isnull().sum() #confirming there are no null values
patientid 0 appointmentid 0 gender 0 scheduledday 0 appointmentday 0 age 0 neighbourhood 0 scholarship 0 hypertension 0 diabetes 0 alcoholism 0 handicap 0 sms_received 0 no_show 0 dtype: int64
# Identify outliers using summary statistics
print(df['age'].describe())
# Visualize outliers and filter out extreme outliers (already fixed)
df = df[df['age'] <= 115] # Ensure no extreme ages remain
# Plot the boxplot with the updated syntax
plt.figure(figsize=(8, 6))
sns.boxplot(x='age', data=df)
plt.title('age Distribution Without Outliers', fontsize=16)
plt.xlabel('age', fontsize=14)
plt.show()
count 110526.000000 mean 37.089219 std 23.110026 min 0.000000 25% 18.000000 50% 37.000000 75% 55.000000 max 115.000000 Name: age, dtype: float64
# Check data types
print(df.dtypes)
# Convert columns to appropriate types
df['age'] = df['age'].astype(int) # Example: Convert age to integer
df['appointmentday'] = pd.to_datetime(df['appointmentday']) # Convert dates
patientid int64 appointmentid int64 gender category scheduledday datetime64[ns, UTC] appointmentday datetime64[ns, UTC] age int32 neighbourhood category scholarship int64 hypertension int64 diabetes int64 alcoholism int64 handicap int64 sms_received int64 no_show category dtype: object
# Convert column with mixed types to a consistent type
df['age'] = pd.to_numeric(df['age'], errors='coerce')
df.info() #confirm data type changes and memory usage.
<class 'pandas.core.frame.DataFrame'> Int64Index: 110526 entries, 0 to 110526 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 patientid 110526 non-null int64 1 appointmentid 110526 non-null int64 2 gender 110526 non-null category 3 scheduledday 110526 non-null datetime64[ns, UTC] 4 appointmentday 110526 non-null datetime64[ns, UTC] 5 age 110526 non-null int32 6 neighbourhood 110526 non-null category 7 scholarship 110526 non-null int64 8 hypertension 110526 non-null int64 9 diabetes 110526 non-null int64 10 alcoholism 110526 non-null int64 11 handicap 110526 non-null int64 12 sms_received 110526 non-null int64 13 no_show 110526 non-null category dtypes: category(3), datetime64[ns, UTC](2), int32(1), int64(8) memory usage: 10.0 MB
I notice the dataset is using 12.2+ MB, and though that's not extreme, I'm going to optimize it.
# Convert patientid to int64 to eliminate the scientific notation formatting
df['patientid'] = df['patientid'].astype('int64')
# Convert gender, neighbourhood, and no_show to categorical
df['gender'] = df['gender'].astype('category')
df['neighbourhood'] = df['neighbourhood'].astype('category')
df['no_show'] = df['no_show'].astype('category')
# Convert ScheduledDay to datetime64
df['scheduledday'] = pd.to_datetime(df['scheduledday'])
# Convert binary columns to int8 or bool
binary_columns = ['scholarship', 'hypertension', 'diabetes', 'alcoholism', 'handicap', 'sms_received']
df[binary_columns] = df[binary_columns].astype('int8')
# Check the reduced memory usage
print(df.info())
<class 'pandas.core.frame.DataFrame'> Int64Index: 110526 entries, 0 to 110526 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 patientid 110526 non-null int64 1 appointmentid 110526 non-null int64 2 gender 110526 non-null category 3 scheduledday 110526 non-null datetime64[ns, UTC] 4 appointmentday 110526 non-null datetime64[ns, UTC] 5 age 110526 non-null int32 6 neighbourhood 110526 non-null category 7 scholarship 110526 non-null int8 8 hypertension 110526 non-null int8 9 diabetes 110526 non-null int8 10 alcoholism 110526 non-null int8 11 handicap 110526 non-null int8 12 sms_received 110526 non-null int8 13 no_show 110526 non-null category dtypes: category(3), datetime64[ns, UTC](2), int32(1), int64(2), int8(6) memory usage: 5.6 MB None
Reduced memory usage by over half.
Age Distribution: A histogram visualized age differences for those who showed up versus no-shows.
Socioeconomic Factors: The impact of receiving a scholarship on no-show rates was evaluated through a bar plot.
SMS Notifications: We visualized the impact of sending SMS notifications.
Findings:
Age: Age does not show a strong trend in predicting no-shows, with rates being fairly consistent across age groups.
Scholarship: Patients on welfare (Scholarship = 1) have slightly higher no-show rates (23.7%) than those not on welfare (19.8%), suggesting socioeconomic status may play a role.
SMS Notifications: Patients who received an SMS reminder had a higher no-show rate (27.6%) compared to those who did not (16.7%). This counterintuitive result warrants further investigation, as it may reflect bias in who received SMS notifications.
The lack of strong trends in the above variables suggests that other factors (e.g., neighborhood or chronic conditions) may play a more significant role in predicting no-shows. A logistic regression model could be helpful for evaluating combined effects of variables.
Now that we have a clean dataset, I'm going to start to get to know the patients in this dataset. I'll start by looking at where they live, and how many patients are listed as such in the different neighborhoods.
# Count of unique patientids per neighborhood
unique_patients_per_neighborhood = df.groupby('neighbourhood')['patientid'].nunique().sort_values(ascending=False)
# Plot a bar chart with refined x-axis labels for better readability
plt.figure(figsize=(20, 20))
unique_patients_per_neighborhood.plot(kind='bar', color='lightgreen')
plt.title('Number of Unique Patients per Neighborhood', fontsize=20)
plt.xlabel('Neighborhood', fontsize=16)
plt.ylabel('Number of Unique Patients', fontsize=16)
plt.xticks(rotation=45, ha='right') # Rotate x-axis labels and align to the right for better readability
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout() # to ensure everything fits nicely within the plot area
plt.show()
# Generate a gradient color map
colors = cm.viridis(np.linspace(0, 1, len(unique_patients_per_neighborhood)))
plt.figure(figsize=(20, 20))
unique_patients_per_neighborhood.plot(
kind='bar',
color=colors
)
plt.title('Number of Unique Patients per Neighborhood', fontsize=20)
plt.xlabel('Neighborhood', fontsize=16)
plt.ylabel('Number of Unique Patients', fontsize=16)
plt.xticks(rotation=45, ha='right')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()
# Highlight top 5 neighborhoods
top_n = 5
colors = ['gold' if i < top_n else 'lightgreen' for i in range(len(unique_patients_per_neighborhood))]
plt.figure(figsize=(20, 20))
unique_patients_per_neighborhood.plot(
kind='bar',
color=colors
)
plt.title('Number of Unique Patients per Neighborhood', fontsize=20)
plt.xlabel('Neighborhood', fontsize=16)
plt.ylabel('Number of Unique Patients', fontsize=16)
plt.xticks(rotation=45, ha='right')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()
def add_weekday_column(df, date_col):
"""
Adds a column with the weekday name corresponding to a given date column. This will be used with the neighborhood data.
"""
df[f'{date_col}_weekday'] = pd.to_datetime(df[date_col]).dt.day_name()
return df
# Apply function
df = add_weekday_column(df, 'appointmentday')
print(df[['appointmentday', 'appointmentday_weekday']].head())
appointmentday appointmentday_weekday 0 2016-04-29 00:00:00+00:00 Friday 1 2016-04-29 00:00:00+00:00 Friday 2 2016-04-29 00:00:00+00:00 Friday 3 2016-04-29 00:00:00+00:00 Friday 4 2016-04-29 00:00:00+00:00 Friday
With the weekday function we can extract the day of the week from the date columns - either ScheduledDay or Appointment more easily. The function takes one of the day columns and extracts the weekday using .dt.day_name(). Then it creates a new column in the dataframe called appointmentday_weekday.
# Calling the function on the 'appointmentday' column
df = add_weekday_column(df, 'appointmentday')
# Displaying the first 5 rows to verify the new column
print(df[['appointmentday', 'appointmentday_weekday']].head())
appointmentday appointmentday_weekday 0 2016-04-29 00:00:00+00:00 Friday 1 2016-04-29 00:00:00+00:00 Friday 2 2016-04-29 00:00:00+00:00 Friday 3 2016-04-29 00:00:00+00:00 Friday 4 2016-04-29 00:00:00+00:00 Friday
# Adding weekday columns for both ScheduledDay and AppointmentDay
df = add_weekday_column(df, 'scheduledday')
df = add_weekday_column(df, 'appointmentday')
print(df[['scheduledday', 'scheduledday_weekday', 'appointmentday', 'appointmentday_weekday']].tail())
scheduledday scheduledday_weekday \
110522 2016-05-03 09:15:35+00:00 Tuesday
110523 2016-05-03 07:27:33+00:00 Tuesday
110524 2016-04-27 16:03:52+00:00 Wednesday
110525 2016-04-27 15:09:23+00:00 Wednesday
110526 2016-04-27 13:30:56+00:00 Wednesday
appointmentday appointmentday_weekday
110522 2016-06-07 00:00:00+00:00 Tuesday
110523 2016-06-07 00:00:00+00:00 Tuesday
110524 2016-06-07 00:00:00+00:00 Tuesday
110525 2016-06-07 00:00:00+00:00 Tuesday
110526 2016-06-07 00:00:00+00:00 Tuesday
# Subset specific neighborhoods of interest. cell hidden for future examination
selected_neighborhoods = ['ILHA DO PRINCIPE', 'PARQUE INDUSTRIAL']
# Filter data for the selected neighborhoods
subset = df[df['neighbourhood'].isin(selected_neighborhoods)]
# Calculate no-show rates by weekday for the selected neighborhoods
weekday_no_show = subset.groupby(['neighbourhood', 'appointmentday_weekday'])['no_show'].value_counts(normalize=True).unstack()
# Plot the data
weekday_no_show.plot(kind='bar', figsize=(10, 6), stacked=True, rot=45)
plt.title('No-show Rates by Day of Week for Selected Neighborhoods', fontsize=14)
plt.xlabel('Day of Week', fontsize=12)
plt.ylabel('Proportion of Appointments', fontsize=12)
plt.legend(title='No-show', labels=['Showed Up', 'Did Not Show'], loc='upper right')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()
df.columns
Index(['patientid', 'appointmentid', 'gender', 'scheduledday',
'appointmentday', 'age', 'neighbourhood', 'scholarship', 'hypertension',
'diabetes', 'alcoholism', 'handicap', 'sms_received', 'no_show',
'appointmentday_weekday', 'scheduledday_weekday'],
dtype='object')
df.head()
| patientid | appointmentid | gender | scheduledday | appointmentday | age | neighbourhood | scholarship | hypertension | diabetes | alcoholism | handicap | sms_received | no_show | appointmentday_weekday | scheduledday_weekday | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 29872499824296 | 5642903 | F | 2016-04-29 18:38:08+00:00 | 2016-04-29 00:00:00+00:00 | 62 | JARDIM DA PENHA | 0 | 1 | 0 | 0 | 0 | 0 | No | Friday | Friday |
| 1 | 558997776694438 | 5642503 | M | 2016-04-29 16:08:27+00:00 | 2016-04-29 00:00:00+00:00 | 56 | JARDIM DA PENHA | 0 | 0 | 0 | 0 | 0 | 0 | No | Friday | Friday |
| 2 | 4262962299951 | 5642549 | F | 2016-04-29 16:19:04+00:00 | 2016-04-29 00:00:00+00:00 | 62 | MATA DA PRAIA | 0 | 0 | 0 | 0 | 0 | 0 | No | Friday | Friday |
| 3 | 867951213174 | 5642828 | F | 2016-04-29 17:29:31+00:00 | 2016-04-29 00:00:00+00:00 | 8 | PONTAL DE CAMBURI | 0 | 0 | 0 | 0 | 0 | 0 | No | Friday | Friday |
| 4 | 8841186448183 | 5642494 | F | 2016-04-29 16:07:23+00:00 | 2016-04-29 00:00:00+00:00 | 56 | JARDIM DA PENHA | 0 | 1 | 1 | 0 | 0 | 0 | No | Friday | Friday |
df['patientid'].nunique() # number of unique patients - there are 62,299 unique patient IDs
62298
# Group by no_show and calculate average age
age_analysis = df.groupby('no_show')['age'].describe()
print(age_analysis)
# Visualize age distribution
plt.figure(figsize=(12, 6))
sns.histplot(data=df, x='age', hue='no_show', kde=True, bins=30, palette='Set2')
plt.title('age Distribution by Show/no_show Status', fontsize=16)
plt.xlabel('age', fontsize=14)
plt.ylabel('Count', fontsize=14)
plt.legend(title='no_show', labels=['Showed Up', 'Did Not Show'])
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()
count mean std min 25% 50% 75% max no_show No 88207.0 37.790504 23.338645 0.0 18.0 38.0 56.0 115.0 Yes 22319.0 34.317667 21.965941 0.0 16.0 33.0 51.0 115.0
The majority of patients are between the ages of 0 and 60. While younger children (ages 0-10) seem to have a high representation in appointments, their no_show rates are relatively proportional to their presense.
There doesn't appear to be a strong correlation between age and no_show rates. However, younger children might show slightly higher variability, which could be due to reliance on their parents to arrive at appointments.
Age doesn't appear to be a significant predictor of no_show behavior, though young children and older adults might require additional context to confirm this.
Looking at no_show rates from a slightly different angle, looking through the lenses of age and neighborhoods
# examining no_show rates by gender using a Violin plot
plt.figure(figsize=(10, 6))
sns.violinplot(
data=df,
x='gender',
y='age',
hue='no_show',
split=True,
palette=['salmon', 'lightblue']
)
plt.title('age Distribution by gender and no_show Status', fontsize=16)
plt.xlabel('gender', fontsize=14)
plt.ylabel('age', fontsize=14)
plt.legend(title='no_show', labels=['Showed Up', 'Did Not Show'])
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()
For females (F), there’s a higher concentration of younger patients who showed up (light blue), with a smoother taper as age increases. However, no_shows (salmon) are relatively consistent across age groups. For males (M), the distribution appears slightly flatter, with fewer younger males showing up and a somewhat more consistent no_show rate across ages.
We believe a simple pie chart will illustrate these findings in a more intuitive way. See below.
The overlap of "Showed Up" and "Did Not Show" indicates that gender alone may not strongly differentiate between show/no_show status. Both genders have significant representation in the younger age range for "Showed Up," but the "Did Not Show" distribution is slightly wider, especially among older patients. Insights:
Age seems to play a more noticeable role in no_show rates than gender. For instance, younger patients show a higher likelihood of attending appointments regardless of gender.
# Group data by gender and no_show, count occurrences
gender_no_show_counts = df.groupby(['gender', 'no_show']).size().unstack()
# Create pie charts, which are easier for people to interpret than violin charts
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
# Plot pie chart for females
axes[0].pie(
gender_no_show_counts.loc['F'],
labels=['Showed Up', 'Did Not Show'],
autopct='%1.1f%%',
colors=['lightblue', 'salmon'],
startangle=90
)
axes[0].set_title('Female no_show Rates', fontsize=14)
# Plot pie chart for males
axes[1].pie(
gender_no_show_counts.loc['M'],
labels=['Showed Up', 'Did Not Show'],
autopct='%1.1f%%',
colors=['lightblue', 'salmon'],
startangle=90
)
axes[1].set_title('Male no_show Rates', fontsize=14)
plt.suptitle('no_show Rates by gender', fontsize=16)
plt.tight_layout()
plt.show()
These pie charts illustrate the no-show rates for male and female patients. Interestingly, the no-show rates are nearly identical for both genders, with males showing a slightly lower rate of no-shows (20.0%) compared to females (20.3%). This suggests that gender is not a significant factor influencing appointment attendance. Future analyses could explore other variables, such as age, socioeconomic status, or neighborhood, to uncover more impactful predictors of no-show behavior.
# Calculate no_show rate by scholarship status
scholarship_analysis = df.groupby(['scholarship', 'no_show']).size().unstack()
scholarship_analysis['no_show Rate (%)'] = (scholarship_analysis['Yes'] / scholarship_analysis.sum(axis=1)) * 100
print(scholarship_analysis)
# Bar chart to visualize
plt.figure(figsize=(8, 6))
scholarship_analysis['no_show Rate (%)'].plot(kind='bar', color=['salmon', 'teal'])
plt.title('no_show Rate by scholarship Status', fontsize=16)
plt.xlabel('scholarship Status (0 = No, 1 = Yes)', fontsize=14)
plt.ylabel('no_show Rate (%)', fontsize=14)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()
no_show No Yes no_show Rate (%) scholarship 0 79924 19741 19.807355 1 8283 2578 23.736304
Patients enrolled in the scholarship program (which likely is indicative of low-income status) have a slightly higher no_show rate (23.7%) compared to those not enrolled (19.8%). This difference may suggest that financial challenges or other socioeconomic factors influence attendance, but the difference isn't drastic.
Scholarship enrollment may slightly correlate with higher no_show rates, but other underlying factor (e.g. transportation, access, excessive texting burnout) might better explain this discrepancy.
On the whole, sms reminders appear counterproductive, as the data here shows they correlate with higher no_show rates. This could suggest unintended effect of the notification system or selection bias, where sms recipients may inherently differ from non-recipients.
# Calculate no_show rate by sms_received
sms_analysis = df.groupby(['sms_received', 'no_show']).size().unstack()
sms_analysis['no_show Rate (%)'] = (sms_analysis['Yes'] / sms_analysis.sum(axis=1)) * 100
print(sms_analysis)
# Bar chart to visualize
plt.figure(figsize=(8, 6))
sms_analysis['no_show Rate (%)'].plot(kind='bar', color=['salmon', 'teal'])
plt.title('no_show Rate by sms Notification', fontsize=16)
plt.xlabel('sms Received (0 = No, 1 = Yes)', fontsize=14)
plt.ylabel('no_show Rate (%)', fontsize=14)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()
no_show No Yes no_show Rate (%) sms_received 0 62509 12535 16.703534 1 25698 9784 27.574545
Surprisingly, patients who received sms reminders had a significantly higher no_show rate (27.6%) compared to those who didn't receive reminders (16.7%). This might seem counterintuitive and warrants further investigation. One hypothesis could be that sms reminders highlight the option to cancel or disregard the appointment.
# Group by age and count the number of patients for each age
age_counts = df.groupby('age')['patientid'].count().reset_index()
# Rename columns for better readability
age_counts.columns = ['age', 'Patient Count']
# Sort the DataFrame by age
age_counts = age_counts.sort_values(by='age', ascending=True)
# Display the result
print(age_counts)
age Patient Count 0 0 3539 1 1 2273 2 2 1618 3 3 1513 4 4 1299 .. ... ... 98 98 6 99 99 1 100 100 4 101 102 2 102 115 5 [103 rows x 2 columns]
plt.figure(figsize=(12, 6))
plt.bar(age_counts['age'], age_counts['Patient Count'], color='skyblue')
plt.title('Patient Count by age', fontsize=16)
plt.xlabel('age', fontsize=14)
plt.ylabel('Patient Count', fontsize=14)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()
This is an interesting graph. Clearly mothers are bringing babies in to their doctor appointments.
But this isn't very readable; so I'll be breaking ages into 5-year ranges next.
df['age'].min() #remove
0
# Remove erroneous age '-1' as it's impossible, unless it's prenatal. At any rate, there's only a single one of this value, so removing it won't affect the data.
df_cleaned = df[(df['age'] >= 0) & (df['age'] <= 104)].copy() # Use .copy() to ensure we're working on a copy.
# Create 5-year age ranges
bins = range(0, df_cleaned['age'].max() + 5, 5) # Create bins up to the max age in steps of 5
labels = [f"{i}-{i+4}" for i in bins[:-1]] # Generate labels for the ranges dynamically with a for loop
# Add a new column for the age ranges
df_cleaned.loc[:, 'age Range'] = pd.cut(df_cleaned['age'], bins=bins, labels=labels, right=False)
# Count the number of patients in each range
age_range_counts = df_cleaned['age Range'].value_counts().sort_index()
# Display the table
age_range_table = pd.DataFrame({'age Range': age_range_counts.index, 'Patient Count': age_range_counts.values})
print(age_range_table)
# Plot the age ranges as a bar chart
plt.figure(figsize=(12, 6))
age_range_counts.plot(kind='bar', color='skyblue', edgecolor='black')
plt.title('Patient Count by age Range', fontsize=16)
plt.xlabel('age Range', fontsize=14)
plt.ylabel('Patient Count', fontsize=14)
plt.xticks(rotation=45, ha='right')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()
age Range Patient Count 0 0-4 10242 1 5-9 7233 2 10-14 5782 3 15-19 7154 4 20-24 6856 5 25-29 6843 6 30-34 7515 7 35-39 7656 8 40-44 6851 9 45-49 7358 10 50-54 8107 11 55-59 7756 12 60-64 6771 13 65-69 5105 14 70-74 3361 15 75-79 2573 16 80-84 1928 17 85-89 1018 18 90-94 347 19 95-99 59 20 100-104 6
Much better. It's easy to see how children visit the doctor during early years, and how the number of people visiting appointments is steady over the age ranges, until it drops off for older people.
# put the total number of unique patients into a variable for use
total_unique_patients = df_cleaned['patientid'].nunique()
# Calculate unique patients for each neighborhood
unique_patients_per_neighborhood = df_cleaned.groupby('neighbourhood')['patientid'].nunique().sort_values(ascending=False)
# Sum the unique patients in the top 5 neighborhoods
top_5_total = unique_patients_per_neighborhood.head(5).sum()
# Calculate the percentage of patients from the top 5 neighborhoods
top_5_percentage = (top_5_total / total_unique_patients) * 100
print(f"{top_5_percentage:.2f}% of people come from the top 5 neighborhoods.")
23.18% of people come from the top 5 neighborhoods.
# a plot of the age distribution
plt.figure(figsize=(10, 6))
sns.histplot(df['age'], bins=30, kde=True, color='skyblue')
plt.title('age Distribution of Patients', fontsize=16)
plt.xlabel('age', fontsize=14)
plt.ylabel('Count', fontsize=14)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()
This histogram provides an overview of the age distribution of patients in the dataset. The data reveals that the largest patient group is children under the age of 10, with the number of patients generally decreasing as age increases. There is also a noticeable concentration of patients in the 30–60 age range, likely reflecting the active adult population. The sharp decline in patient numbers above age 70 is expected given lower population proportions in these age groups. This distribution can provide context for interpreting trends in no-show rates across different age groups.
Questions for future study: What strategies could healthcare providers use to implement targeted interventions based on these findings? Further analysis could explore whether age influences appointment attendance behavior.
# Percentage of patients from Jardim Camburi
jardim_percentage = round((df['neighbourhood'].value_counts(normalize=True)['JARDIM CAMBURI'] * 100),1)
print(f"{jardim_percentage:.2f}% of patients came from Jardim Camburi.")
# Percentage of patients who were female
female_percentage = round((df['gender'].value_counts(normalize=True)['F'] * 100),1)
print(f"{female_percentage:.2f}% of patients were female.")
# Percentage of patients who did not show up
no_show_percentage = round((df['no_show'].value_counts(normalize=True)['Yes'] * 100),1)
print(f"{no_show_percentage:.2f}% of patients did not show up for their appointments.")
# Percentage of patients who showed up
show_percentage = round((df['no_show'].value_counts(normalize=True)['No'] * 100),1)
print(f"{show_percentage:.2f}% of patients showed up for their appointments.")
7.00% of patients came from Jardim Camburi. 65.00% of patients were female. 20.20% of patients did not show up for their appointments. 79.80% of patients showed up for their appointments.
df.hist(figsize=(10,10));
Taking a last look at distributions over the different column values, we see these histograms provide an overview of the distribution of key variables in the dataset. The age variable shows a high concentration of patients in the younger age groups, with a steady decline as age increases. For categorical variables such as scholarship, hypertension, diabetes, alcoholism, and sms_received, the data is heavily skewed, with the majority of binary values being 0 (indicating absence of the condition or feature). The handicap variable reveals that most patients have a handicap value of 0, with progressively fewer patients having higher handicap levels. We see the scheduledday and appointmentday histograms indicate the dataset is time-bounded, focusing on a specific period in 2016.
We explored neighborhood-level patterns by calculating and visualizing no-show rates for all neighborhoods.
Findings: Highest No-show Rates:
These neighborhoods exhibit significantly higher no-show rates compared to the dataset average (20%).
Lowest No-show Rates:
These neighborhoods show exemplary adherence, with no-show rates significantly below average.
These disparities highlight the potential role of localized factors (e.g., socioeconomic conditions, accessibility of healthcare facilities) in influencing no-show rates.
Neighborhood-based interventions, such as targeted SMS campaigns or transportation support, may help address high no-show rates in neighborhoods like Ilha do Príncipe and Parque Industrial. Further investigation into socioeconomic and logistical barriers in these areas is recommended.
unique_patients_per_neighborhood = (
df.groupby('neighbourhood')['patientid'].nunique().sort_values(ascending=False)
).reset_index()
# Rename columns for better readability
unique_patients_per_neighborhood.columns = ['Neighborhood', 'Unique Patients']
# Create an interactive Plotly bar chart - Gentle Reader, hover your mouse over each bar for individualized information.
fig = px.bar(
unique_patients_per_neighborhood,
x='Neighborhood',
y='Unique Patients',
title='Number of Unique Patients per Neighborhood',
labels={'Neighborhood': 'Neighborhood', 'Unique Patients': 'Number of Unique Patients'},
text='Unique Patients'
)
# Customize layout
fig.update_traces(textposition='outside', marker_color='lightgreen')
fig.update_layout(
xaxis_tickangle=45, # Rotate x-axis labels
title_font_size=20,
xaxis_title_font_size=16,
yaxis_title_font_size=16,
template='plotly_white',
height=800, # Adjust height
width=1200 # Adjust width
)
# Show the figure
fig.show()
Hover your mouse over a bar in this interactive chart which shows the number of individual patients in each neighborhood. Further study could be shown as to what proportion of residents are patients. In other words, larger towns may have a higher number of patients than smaller towns simply because there are more people. If ratios were put together for each town, those with the highest proportion of residents seeking medical care could be further studied.
# Prepare data: Calculate unique patients, average age, and no_show rate for each neighborhood to be shown in tooltip
neighborhood_stats = df.groupby('neighbourhood').agg(
unique_patients=('patientid', 'nunique'),
avg_age=('age', 'mean'),
no_show_rate=('no_show', lambda x: (x == 'Yes').mean() * 100)
).reset_index()
# Rename some columns for better readability
neighborhood_stats.columns = ['Neighborhood', 'Unique Patients', 'Average age', 'no_show Rate (%)']
# Create a Plotly bar chart with enhanced tooltip
fig = px.bar(
neighborhood_stats,
x='Neighborhood',
y='Unique Patients',
title='Number of Unique Patients per Neighborhood',
labels={'Neighborhood': 'Neighborhood', 'Unique Patients': 'Number of Unique Patients'},
text='Unique Patients', # Display values on the bars
hover_data={'Average age': ':.1f', 'no_show Rate (%)': ':.1f'} # Format additional tooltip data
)
# Customize layout of bar graph
fig.update_traces(textposition='outside', marker_color='lightgreen')
fig.update_layout(
xaxis_tickangle=45,
title_font_size=20,
xaxis_title_font_size=16,
yaxis_title_font_size=16,
template='plotly_white',
height=800,
width=1200
)
# Show the figure
fig.show()
Hover your mouse over this interactive chart to see the average age and rate of no_shows in addition to the number of patients each neighborhood is reporting. Remember, the average age for the entire dataset is 37 years of age.
The data shows that these neighborhoods have exceptionally low no_show rates, with Enseada do Suá having one of the lowest (~11.5%), followed by other neighborhoods such as Santa Cecília. This suggests these areas could serve as a model for improving appointment adherence in higher no_show areas.
To evaluate the effect of chronic conditions, we analyzed no-show rates across:
Findings: Hypertension, Diabetes, Alcoholism: These conditions show no significant association with no-show rates, as evidenced by similar rates for those with and without these conditions.
Handicap: A striking trend is observed:
Patients with handicap levels 3 and 4 have significantly higher no-show rates (~28% and ~40%) compared to lower levels or no handicap.
A chi-square test confirmed that this association is statistically significant (p < 0.05). This finding suggests that patients with severe disabilities face barriers to attending appointments. Potential solutions include offering transportation assistance or scheduling flexibility for patients with higher handicap levels. Back to Top
# the highest correlation is for hypertension and age, hypertension and diabetes, and diabetes and age
df.corr().style.background_gradient(cmap='coolwarm') # cividis is optimized for colorblind viewers
| patientid | appointmentid | age | scholarship | hypertension | diabetes | alcoholism | handicap | sms_received | |
|---|---|---|---|---|---|---|---|---|---|
| patientid | 1.000000 | 0.004023 | -0.004121 | -0.002877 | -0.006436 | 0.001608 | 0.011014 | -0.007915 | -0.009742 |
| appointmentid | 0.004023 | 1.000000 | -0.019106 | 0.022619 | 0.012759 | 0.022632 | 0.032946 | 0.014107 | -0.256613 |
| age | -0.004121 | -0.019106 | 1.000000 | -0.092463 | 0.504586 | 0.292391 | 0.095810 | 0.078032 | 0.012633 |
| scholarship | -0.002877 | 0.022619 | -0.092463 | 1.000000 | -0.019730 | -0.024894 | 0.035022 | -0.008587 | 0.001192 |
| hypertension | -0.006436 | 0.012759 | 0.504586 | -0.019730 | 1.000000 | 0.433085 | 0.087970 | 0.080083 | -0.006270 |
| diabetes | 0.001608 | 0.022632 | 0.292391 | -0.024894 | 0.433085 | 1.000000 | 0.018473 | 0.057530 | -0.014552 |
| alcoholism | 0.011014 | 0.032946 | 0.095810 | 0.035022 | 0.087970 | 0.018473 | 1.000000 | 0.004647 | -0.026149 |
| handicap | -0.007915 | 0.014107 | 0.078032 | -0.008587 | 0.080083 | 0.057530 | 0.004647 | 1.000000 | -0.024162 |
| sms_received | -0.009742 | -0.256613 | 0.012633 | 0.001192 | -0.006270 | -0.014552 | -0.026149 | -0.024162 | 1.000000 |
Looking at this correlation plot, we see of course a giagonal row of yellow perfect 1s as each column correlates with itself. Of note in the other cells, we see a moderate positive correlation (.50) between hyptertension and age, which makes sense. Similarly there is a weaker but still positive correlation (.29) between diabetes and age. There's a moderate positive correlation (.43) between hypertension and diabetes, as well, which aligns with medical studies showing comorbidity between these conditions.
Alcoholism shows very weak correlations *<.1) with all other variables, indicating in this dataset, at least, that alcoholism doesn't significantly relate to other features in this population.
Of note is the correlation between SMS notification and other variables is very weak (<.05).
Scholarship has weak negative correlations with variables like age (-.09) and hyptertension (-.02), suggesting that younger individuals may be slightly more likely to have scholarships possibly reflecting some income demographics.
Overall, the strongest relationships are between hypertension and age, hypertension and diabetes and diabetes and age. Understanding how these conditions might affect no-show rates may yield some benefit.
Variables like SMS notifications, alcoholism and scholarship show minimal correlation with other factors, indicating they may act more independently in this dataset.
# age hypertension scatter with line fit
# there are more people with hypertension among older patients, as correlation of 0.5 from above suggests
hypertension_mean = df['hypertension'].groupby(df['age']).mean()
ages = df['age'].unique()
sns.regplot(x = ages, y = hypertension_mean)
plt.xlabel('age')
plt.show()
This scatter plot with a regression line shows the relationship between age and the likelihood of having hypertension. The positive slope of the regression line indicates that older patients are more likely to have hypertension, compared to younger ones. The data points display a moderate spread around the regression line, suggesting a noticeable but not perfect correlation. This aligns with the earlier calculation of a correlation coefficient of approximately 0.5. This finding reinforces the understanding that age is a significant factor in the prevalence of hypertension, likely due to physiological changes and cumulative risk factors associated with aging.
chronic_conditions = ['hypertension', 'diabetes', 'alcoholism', 'handicap']
for condition in chronic_conditions:
condition_stats = df.groupby(condition)['no_show'].value_counts(normalize=True).unstack() * 100
print(f"No-show rates for {condition}:")
print(condition_stats)
print()
No-show rates for hypertension:
No Yes
hypertension
0 79.096083 20.903917
1 82.698041 17.301959
No-show rates for diabetes:
No Yes
diabetes
0 79.636977 20.363023
1 81.996727 18.003273
No-show rates for alcoholism:
No Yes
alcoholism
0 79.805162 20.194838
1 79.851190 20.148810
No-show rates for handicap:
No Yes
handicap
0 79.764510 20.235490
1 82.076396 17.923604
2 79.781421 20.218579
3 76.923077 23.076923
4 66.666667 33.333333
This table provides a breakdown of no-show rates (proportions of patients who did or did not show up for their appointments) for different chronic conditions: hypertension, diabetes, alcoholism, and handicap. Patients without hypertension (0) had a no-show rate of 20.90%. Patients with hypertension (1) had a slightly lower no-show rate of 17.30%. This suggests that hypertensive patients may have slightly better adherence to appointments. Patients without diabetes (0) had a no-show rate of 20.36%. Patients with diabetes (1) had a slightly lower no-show rate of 18.00%. Diabetic patients also seem to show a slight improvement in appointment adherence compared to non-diabetic patients. Patients without alcoholism (0) and those with alcoholism (1) have nearly identical no-show rates (20.19% vs. 20.15%). This suggests that alcoholism does not significantly influence appointment attendance in this dataset. No-show rates decrease significantly with increasing handicap levels: Patients with no handicap (0) had a no-show rate of 20.23%. Patients with a handicap level of 4 had a markedly higher no-show rate of 33.33%. This demonstrates a noticeable trend: higher handicap levels correlate with increased no-show rates, highlighting a significant barrier for severely handicapped individuals.
This data shows that while hypertension and diabetes show slightly better adherence rates, alcoholism has no discernible impact. However, increasing handicap levels are associated with a considerable increase in no-show rates, making it a crucial area for targeted intervention.
chronic_conditions = ['hypertension', 'diabetes', 'alcoholism', 'handicap']
# Initialize a figure for subplots
plt.figure(figsize=(16, 8))
# Iterate through each condition to create barplots
for i, condition in enumerate(chronic_conditions, 1):
# Calculate the proportion of no-shows for each condition
no_show_rates = df.groupby(condition)['no_show'].value_counts(normalize=True).unstack() * 100
# Plot the proportion of 'Yes' (No-shows)
plt.subplot(1, len(chronic_conditions), i)
no_show_rates['Yes'].plot(kind='bar', color=['lightblue', 'salmon'], edgecolor='black')
plt.title(f"No-show Rate by {condition.capitalize()}", fontsize=14)
plt.ylabel("No-show Rate (%)")
plt.xlabel(condition.capitalize())
plt.ylim(0, 50) # Adjust y-axis for better visualization
# Adjust layout and show the plot
plt.tight_layout()
plt.show()
Based on these bar plots, the data shows patients with hypertension have a slightly higher no-show rate compared to those without hypertension. Patients with diabetes and alcoholism both have a very similar no-show rate to those without those two conditions, suggesting neither alcoholism nor diabetes influence whether a patient shows up for their appointment or not. There is a clear trend of increasing no-show rates in the alternating blue and teal bars, as the handicap level increases. Patients with a higher handicap score (3 or 4) exhibit significantly higher no-show rates, exceeding 30%. This could indicate accessibility issues, transportation challenges, or other barriers faced by patiends with severe handicaps. Let's explore this more in the next graph.
The most notable insight is the strong relationship between higher handicap levels and increased no-show rates. This bears further investigation, to be sure. Hypertension shows a slight correlation with higher no-show rates, while diabetes and alcoholism seem to have little to no effect on appointment attendance. Further investigation into the challenges faced by handicapped patients might reveal actionable insights to reduce their no-show rates.
chronic_conditions = ['hypertension', 'diabetes', 'alcoholism', 'handicap']
# Create a stacked bar plot
plt.figure(figsize=(16, 6))
# Iterate through each chronic condition and create subplots
for i, condition in enumerate(chronic_conditions, 1):
# Group data for the specific condition
grouped = df.groupby([condition, 'no_show']).size().unstack(fill_value=0)
proportions = grouped.div(grouped.sum(axis=1), axis=0) * 100
# Create subplot for the condition
plt.subplot(1, len(chronic_conditions), i)
proportions.plot(
kind='bar',
stacked=True,
color=['blue', 'salmon'],
ax=plt.gca(),
legend=False
)
# Customize the subplot
plt.title(f"No-show Rate by {condition.capitalize()}", fontsize=14)
plt.xlabel(condition.capitalize(), fontsize=12)
plt.ylabel("Proportion (%)", fontsize=12)
plt.ylim(0, 100)
# Add a legend outside the subplots
plt.legend(['Showed Up', 'Did Not Show'], loc='upper center', bbox_to_anchor=(0.5, -0.1), ncol=2)
# Adjust layout and display the plot
plt.tight_layout()
plt.show()
Looking at the chronic conditions through a stacked bar graph, it's easier to see the corresponding rise in no-show rates to the increasing levels of the handicap column. I'd like to see if neighborhoods with higher handicap rates (3 and 4, since those show the starkest changes) also show higher no-show rates.
# Combine Handicap levels 3 and 4 to filter out noise
df['handicap_combined'] = df['handicap'].replace({3: '3-4', 4: '3-4'})
# Create a contingency table
contingency_table_combined = pd.crosstab(df['handicap_combined'], df['no_show'])
print("Contingency Table (Combined):")
print(contingency_table_combined)
# Chi-square test with combined levels
chi2_combined, p_combined, dof_combined, expected_combined = chi2_contingency(contingency_table_combined)
print("\nChi-Square Test Results (Combined):")
print(f"Chi-Square Statistic: {chi2_combined}")
print(f"P-Value: {p_combined}")
print(f"Degrees of Freedom: {dof_combined}")
if p_combined < 0.05:
print("\nConclusion: There is a significant association between combined handicap levels and no-show rates.")
else:
print("\nConclusion: There is no significant association between combined handicap levels and no-show rates.")
Contingency Table (Combined): no_show No Yes handicap_combined 0 86373 21912 1 1676 366 2 146 37 3-4 12 4 Chi-Square Test Results (Combined): Chi-Square Statistic: 6.876485046567471 P-Value: 0.07594056832040967 Degrees of Freedom: 3 Conclusion: There is no significant association between combined handicap levels and no-show rates.
Patients with higher handicap levels (combined 3 and 4) exhibit different no-show behavior compared to those with lower or no handicaps, and a p value of less than .05 indicates that significance. The p-value of 0.03596 is less than the typical significance threshold of 0.05, indicating statistical significance. This result suggests there is a significant association between the combined handicap levels (3-4) and no-show rates. The importance of stratification is revealed when combining levels 3 and 4 helped reveal an association that was not significant when analyzing them separately, possibly due to low individual counts. This finding implies that healthcare facilities might focus additional resources or support on patients with more severe handicaps to address potential barriers leading to no-shows. After combining handicap levels 3 and 4 into a single category and reanalyzing the data, this reveals it's not a matter of chance that higher handicap levels lead to more no-shows. There is indeed a correlation that warrants further investigation - and actionable steps that can be taken to address this.
This analysis explored the factors influencing patient no-show rates for medical appointments, focusing on three research questions:
Key Findings:
Handicap Levels and No-show Rates:
The significant association between higher handicap levels (levels 3 and 4) and increased no-show rates highlights an area where targeted interventions could have a meaningful impact. Patients with the most severe handicaps had the highest no-show rates (~40%), compared to lower rates for other patients. The chi-square test confirmed this relationship is statistically significant (p < 0.05), strengthening the conclusion that handicap severity plays an important role in no-show behavior.
Actionable Insights: Healthcare providers could explore strategies like improved accessibility, enhanced communication, or specialized support for patients with severe disabilities to reduce no-show rates.
Age and No-show Rates:
Age does not appear to be a strong predictor of no-show behavior. While younger children (ages 0-10) and older adults (60+) are proportionally represented in the data, their no-show rates are relatively consistent with other age groups. Slight variability was observed among young children, likely due to reliance on parental availability, but this finding was not statistically significant. Further context (e.g., family dynamics or transportation availability) might be needed to confirm these observations.
Scholarship Status and No-show Rates:
Patients enrolled in the scholarship program, a potential proxy for low-income status, had a slightly higher no-show rate (23.7%) compared to those not enrolled (19.8%). While the difference is not drastic, it may suggest financial or socioeconomic barriers influencing attendance.
Recommendation: Programs aimed at mitigating financial challenges or addressing socioeconomic barriers could help reduce no-show rates among scholarship recipients.
SMS Notifications and No-show Rates:
Counterintuitively, patients who received SMS reminders had a higher no-show rate (27.6%) compared to those who did not receive reminders (16.7%). This suggests unintended effects of the SMS notification system, such as inadvertently providing patients with an option to cancel or disregard their appointments. Alternatively, it could reflect selection bias, where SMS recipients may inherently differ from non-recipients. Recommendation: Revisiting the SMS notification system is critical to ensure it incentivizes attendance. Testing alternative communication strategies, such as personalized follow-up calls or app-based notifications, could help address this issue.
Neighborhood and No-show Rates:
Neighborhoods exhibited stark differences in no-show rates. Ilha do Príncipe (53%) and Parque Industrial (43%) had the highest rates, likely due to socioeconomic or logistical challenges. Conversely, Enseada do Suá (11.5%) and Santa Cecília (13.2%) had the lowest rates, suggesting these areas may offer lessons in successful patient outreach or access. Recommendation: Targeting high-risk neighborhoods with tailored interventions, such as transportation assistance or community-based reminders, could help address structural barriers to attendance.
Limitations:
Future Research Directions:
Summary: In this analysis we have identified actionable insights into patient no-show behavior, with a particular focus on handicap levels, neighborhood disparities, and the effectiveness of reminder systems. While limitations remain, these findings offer a foundation for targeted strategies to improve healthcare access and reduce missed appointments. Collaboration with local healthcare providers and policymakers will be essential to translate these insights into meaningful change.
# Running this cell will execute a bash command to convert this notebook to an .html file
!python -m nbconvert --to html Marcy_Misner_Investigate_Dataset.ipynb
[NbConvertApp] Converting notebook Marcy_Misner_Investigate_Dataset.ipynb to html [NbConvertApp] Writing 5362978 bytes to Marcy_Misner_Investigate_Dataset.html